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1 "*~i ■ In this paper we introduce new estimators of the coefficient functions in the varying coeffi- 

cient regression model. The proposed estimators are obtained by projecting the vector of the 
full-dimensional kernel-weighted local polynomial estimators of the coefficient functions onto 
a Hilbert space with a suitable norm. We provide a backfitting algorithm to compute the es- 
timators. We show that the algorithm converges at a geometric rate under weak conditions. 
We derive the asymptotic distributions of the estimators and show that the estimators have 
the oracle properties. This is done for the general order of local polynomial fitting and for the 
estimation of the derivatives of the coefficient functions, as well as the coefficient functions 
themselves. The estimators turn out to have several theoretical and numerical advantages over 
the marginal integration estimators studied by Yang, Park, Xue and Hardle [J. Arner. Statist. 
Assoc. 101 (2006) 1212-1227]. 
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1. Introduction 



Vh ■ In this paper we consider a varying coefficient regression model proposed by Hastie and 

Tibshirani [12] and studied by Yang, Park, Xue and Hardle [24]. The model takes the 
form Y l = m(X\Z l )+a(X\Z l )e\ i = l,...,n, where 

d 
m(X,Z)=^m,(Y,)Z„ (1.1) 

i=i 

rrij are unknown coefficient functions, X* = (X{, . . . , X d ) T and Z* = (Z\, . . . , Z^) T are ob- 
served vectors of covariates, and e l are the error variables such that E(e l \X. 1 , Z J ) = and 
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2 Y.K. Lee, E. Mammen and B. U. Park 

var(e*|X l ,Z J ) = 1. We assume that (X l ,Z 4 ,Y*) for 1 < i < n are independent and iden- 
tically distributed. The model is simple in structure and easily interpreted, yet flexible, 
since the dependence of the response variable on the covariates is modeled in a nonpara- 
metric way. The model is different from the functional coefficient model of Chen and 
Tsay [4], Fan and Zhang [8], Cai, Fan and Li [2] and Cai, Fan and Yao [3], where rrij 
are functions of a single variable, that is, m{X % , Z 8 ) = X),=i rn j{-^ 1 )^)- Fitting the latter 
model is much simpler than the model (1.1) since it involves only a univariate smoothing 
across the single variable X. 

To fit the model (1.1), we may apply the idea of local polynomial smoothing. To 
illustrate the difficulty in fitting the model, suppose that we employ local constant fitting 
so that wc minimize 



E 



d 



T 2 

K h {x u X{)---K h {x d ,X d ) 



with respect to 8j, 1 < j < d, to get estimators of m,j(xj), 1 < j < d, where Kh is a kernel 
function. For each coefficient rrij , this yields an estimator which is a function of not only Xj 
but also other variables Xk, k ^ j. The marginal integration method, proposed and stud- 
ied by Yang et al. [24], is simply to take the average of 6j{X{,. . . 7 X*_ 1 ,xj,Xj +1 , . . .,X d ) 
in order to eliminate the dependence on the other variables. 

In this paper we propose a new method for fitting the model (1.1). The proposed 
method is to project the vector of the full-dimensional kernel-weighted local polynomial 
estimators (9j , 1 < j < d, in the above, in the case of local constant fitting) onto a space of 
vectors of functions /j : R — >• R, 1 < j <d, with a suitable norm. Projection-type estima- 
tion has been studied in other structured nonpar ametric regression models. For example, 
the smooth backfitting method was proposed by Mammen, Linton and Nielsen [17] to 
fit additive regression models. The same idea was applied to quasi-likclihood additive 
regression by Yu, Park and Mammen [25] and to additive quantile regression by Lee, 
Mammen and Park [16]. Some nonparametric time series models have been proposed 
with unobserved factors Zj that do not depend on the individual but on time; see, for 
example, Connor, Linton and Hagmann [5], Fengler, Hardle and Mammen [9] and Park, 
Mammen, Hardle and Borak [21]. In these papers it has been shown that one can also 
proceed asymptotically in the models under consideration, as if the factors would have 
been observed. We note that the current problem does not fit into the framework of the 
above papers but requires a different treatment. In particular, in the model (1.1), the 
functions rrij are not additive components of the regression function, but they are the 
coefficients of Zj . For a treatment of our model we have to exclude the case of constant 
Zj = 1. In the case of constant Zj, model (1.1) reduces to the additive model. The key 
element in the derivation of the theory for our model is to embed the vector of the co- 
efficient functions into an additive space of vectors of univariate functions and then to 
endow the space with a norm where the covariates Zj enter with kernel weights. 

As far as we know, the marginal integration method has been the only method to 
fit the model (1-1). It is widely accepted that the marginal integration method suffers 
from the curse of dimensionality. Inspired by Fan, Hardle and Mammen [6] and others, 
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Yang et al. [24] tried to avoid the dimensionality problem by using two different types of 
kernels and bandwidths. To be more specific, consider estimation of rrij for a particular j . 
The method then uses a kernel, say L, and bandwidths, say bk, for the directions of Xk 
(k =/= j), which are different from a kernel K and a bandwidth hj for the direction of Xj . 
By choosing bk <C hj and taking a higher order kernel L, we can achieve the univariate 
optimal rate of convergence for the resulting estimator of rrij . One of the main difficulties 
with the marginal integration method is that there is no formula available for the optimal 
choice of the secondary bandwidths bk- Also, the performance of the method depends 
crucially on the choice of the secondary bandwidths bk, as observed in our numerical 
study; see Section 5. Furthermore, the method involves estimation of a full-dimensional 
regression estimator, which requires inversion of a full-dimensional [(tt + l)d] x [(it + l)d] 
smoothing matrix, where it is the order of local polynomial fitting. This means that the 
method may break down in practice in high dimension. 

On the contrary, the proposed method may use bandwidths of the same order for 
all directions to achieve the univariate optimal rate of convergence, and we derive for- 
mulas for the optimal bandwidths. The method requires only one- and two-dimensional 
smoothing and inversion of a (tt + 1 ) x (tt + 1) matrix which is computed by a single- 
dimensional local smoothing. Thus, the proposed method does not suffer from the curse 
of dimensionality in practice as well as in theory. We show that the method has the oracle 
properties, meaning that the proposed estimator of rrij for each j has the same first-order 
asymptotic properties as the oracle (infeasible) estimator of rrij that uses the knowledge 
of all other coefficient functions m^ , k ^ j. We develop the theory for the method with 
local polynomial fitting of general order tt > 0. Thus, the theory gives the asymptotic 

(k) 

distributions of the estimators of rrij, as well as their derivatives rrv- , 1 < k < tt. 

There have been several works on a related varying coefficient model where the co- 
efficients are time- varying functions. These include Hoover, Rice, Wu and Yang [13], 
Huang, Wu and Zhou [14, 15], Wang, Li and Huang [23] and Noh and Park [19]. The ker- 
nel method of fitting this model is quite different from, and simpler than, the method of 
fitting our model (1.1), since the former involves only a univariate smoothing across time. 
Recently, Park, Hwang and Park [20] considered a testing problem for the model (1.1) 
based on the marginal integration method. 

This paper is organized as follows. In the next section, we describe the proposed method 
with local constant fitting and then, in Section 3, we give its theoretical properties. 
Section 4 is devoted to the extension of the method and theory to local polynomial 
fitting of general order. In Section 5 we present the results of our numerical study. In 
Section 6 we apply the proposed method to Boston Housing Data. Technical details are 
contained in the Appendix. 

2. The method with local constant fitting 

Although our main focus is to introduce the method with local polynomial fitting and to 
develop its general theory, we start with local constant fitting since the method is better 
understood in the latter setting. Let Y be the response variable, and X = (X\, . . . , Xd) T 
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and Z = (Zi,. . . ,Zd) T be the covariate vectors of dimension d. Let {(X 1 , Z l ,Y l )}2 =1 
be a random sample drawn from (X, Zi,Y). Assume that the density p of X is sup- 
ported on [0, l] d . To estimate the coefficient functions raj in the model (1.1), we consider 
a 'smoothed' squared error loss. Similar ideas were adopted for additive regression by 
Mammen et al. [17] and for quasi-likelihood additive regression by Yu et al. [25]. 

Let K be a nonnegative function, called a base kernel. To define a smoothed squared 
error loss, we use a boundary corrected kernel, as in Mammen et al. [17] and Yu et al. [25], 
which is defined by 



K g (u,v) 



K 



9 



Aw 



9 



/(«,«€ [0,1]). 



Suppose that we use different bandwidths for different directions. Let h = (hi, . . . , hd) be 
the bandwidth vector. For simplicity, we focus on the case where we use a product kernel 
of the form A'h(u, v) = fli=i Khj (uj, Vj). We may use a more general multivariate kernel, 
but this would require more involved notation and technical arguments. The proposed 



estimator of m = (mi , 
minimizer of 



■ ,m d ) 



, where rrij(x) = mj(xj), is defined to be the 



~ n 

L(f) = jn- 1 Y, 



-I 2 



A h (x,X l )dx 



**-£/i(*i)3 

3=1 

over f = (fi,..-,fd) with L(f) < oc. Here and hereafter, integration over x is on 
[0,l] d . Define M(x) = 71" 1 J27=i A h (x,X*)Z l Z* T . Then, L(f) < oo is equivalent to 
Jf(x) T M(x)f(x) dx < oo. The function space that arises in the minimization problem 
is 

H(M) = {f S L 2 (M) : /j(x) = gj(xj) for a function g 3 ■ : M ->■ R, 1 < j < d}, 
where L-2(M.) denotes a class of function vectors f defined by 

L 2 (M) = if :f(x) = (/i(x),...,/ d (x)) T for some functions fj-.^^R 



and / f (x) T M(x)f (x) dx < oo 



The spaces L2QVI) and T-L(M.) are Hilbert spaces equipped with a (semi)norm 
defined by 



IM' 



l f HL 



f(x) T M(x)f(x)dx. 



Let M(x) =£'(ZZ T |X = x)p(x). Since ||f||^ converges to 



|f||^^y'f(x) T M(x)f(x)dx 
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in probability under certain conditions, the corresponding Hilbert spaces in the limit 
are L2CM.) and "H(M), which are defined as L2(M.) and 'H(M), respectively, with M 
being replaced by M. Here, we note that || ■ ||m becomes a norm if we assume that 

f (X) T Z = almost surely implies f = 0. (2.1) 

In fact, the assumption (2.1) is known to be a sufficient condition for avoiding concurvity, 
as termed by Hastie and Tibshirani [11], an analog of collincarity in linear models. If the 
assumption does not hold, then the m,j are not identifiable. This is because, for f such 
that f (X) T Z = almost surely, we have 

E(Y\X, Z) = m(X) T Z = [m(X) + f (X)] T Z. 

The assumption (2.1) is satisfied if we assume that the smallest eigenvalue of E(ZZ T \ 
X = x) is bounded away from zero on [0, l] d . 
For f G H(M), we obtain 

/n 
n- 1 Y,[Y l - rii(x) T Z I ] 2 J ftT h (x, X 1 ) dx 

i— 1 

+ Am(x) - f(x)] T M(x)[rh(x) - f(x)] dx, 
where rh is the minimizer of L(f) over f S L2(M). It is given explicitly as 

n 

rh(x) = Mtx)- 1 ™- 1 J2 Z'y'ifhfc, X*). (2.2) 

Thus, the proposed estimator rh = (mi,...,rhd) can be defined equivalently as the 
projection of rh onto "H(M): 

rh = argmin||rh-f||| I . (2.3) 

fe-H(M) 

By considering the Gateaux or Frcchct derivatives of the objective function with respect 
to f , the solution rh of the minimization problem (2.3) satisfies the following system of 
integral equations: 

0= / > M J (x) T [rh(x)-rh(x)]dx_ J , 1 < j < d, (2.4) 

where Mj are defined by M = (Mi, . . . ,M<i) T and x_j = (xi, . . . ,Xj-i,Xj+i, . . . ,Xd) T ■ 
In fact, the system (2.4) turns out to be a backfitting system of equations. To see this, 
we define 

n 

m J (x ] ) = q J (x j y 1 n- 1 Y / K hj (x ] ,X*)Z*Y\ 
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n 

q j (x j ) = n- l Y,Kh j (x j ,X})(Zl) 2 , 

i=\ 
n 

qj k (x j ,x k )=n- 1 Y^Kh j {x j ,X])K hk {x k ,X i k )Z i j Z i k , k^j. 

We note that, by definition, rhj : R — > R does not equal the jth component of m, which 
maps R d to R. We can then see that 



Mj(x) m(x)dx_j =rhj(xj)qj(xj), 
/ M J -(x) T rh(x)dx_ J = rhj(xj)qj(xj) + ^ rh k (x k )q jk (x : j,x k )d 



x k . 

k=l,7tj J 

The second formula is obtained by using the following property of the boundary corrected 
kernel: J Kh (uj,Vj)duj = 1. Thus, the system of equations (2.4) is equivalent to 

rhj(xj)=rhj(xj)- ^ / rh k {x k ) jfc „ J ' k dx k , l<j<d. (2.5) 

We emphasize that the proposed method does not require computation of the full- 
dimensional estimator m(x) at (2.2). It only requires one- and two-dimensional smooth- 
ing to compute fhj, q~j and qj k , and involves inversion of cjj only. In contrast, the marginal 
integration method studied by Yang et al. [24] involves the computation of rh(x), 
which requires inversion of the full-dimensional smoothing matrix M. Thus, in prac- 
tice, the marginal integration method may break down in high dimensions where d is 
large. 

We express the updating equations (2.5) in terms of projections onto suitable function 
spaces. This representation is particularly useful in our theoretical development. We 
consider Hj(M), 1 < j < d, subspaccs of TL(M.) defined by 

Hj (M) = {f € L 2 (M) : fj (x) = g 3 (xj ) for a function g, : R -)■ R, f k = for k # j}. 

With this definition, we have 

H(M)=Hi(M) + --- + H d (M). 

Also, denoting the projection operator onto a closed subspace H of H(M) by II(-|H) and 
its jth element by H(-\H)j, we get, for f £ L%QS/l), 

n(f|^(M)) i = g i ( ; r J )- 1 /M i (x) T f(x)dx_ j , 

(2-6) 

n(f|^(M)) fc = o, kjtj. 
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In particular, for f £ H(M), we have 



n(f|^.(M)), =/,(*,)+ £ f fk{x k f j \ {X , j ^ k) dx k . (2.7) 

k=i,^j J q ^ x i> 

Furthermore, for f £ Wfe(M), 

U(i\H 3 (M)) 3 = f f k (x k ) ^' Xk) dx k , j^k. (2.8) 

J qj (Xj ) 

For m £ tH(M.) , let m, (x) = (0, . . . , 0, rhj (xj), 0, . . . , 0) T denote the vector whose jth 

entry equals rhj(xj), the rest being zero. We can then decompose m as m = mH hihd- 

From (2.5) and (2.8), we obtain 

m J =n(ih- E rhfc-H^M)), l<j<d. (2.9) 

The backfitting equations (2.5), or their equivalent forms (2.9), give the following back- 
fitting algorithm. 

Backfitting algorithm. With a set of initial estimates rh^ , iterate for r = 1, 2, . . . the 
following process: for 1 < j ' < d, 

-Mr \ ~ r \ V^ f -Mr \Qjk{xj,x k ) , 

k=l J ( l3\ J '3J 

d 



il r-i} f _^ qj k (xj,x k ) 

k =3+i J 
or, equivalently 



E M^o^f^ 

k =j + l J 9 ^ X ^ 



m^nfm-XX 1 - E At 1] \Hi(jSX)\ (2.10) 

V fc=i fc=j+i / 

3. Theoretical properties of the local constant method 

3.1. Convergence of the backfitting algorithm 

The theoretical development for the backfitting algorithm (2.10) and for the solution of 
the backfitting equation (2.9) does not fit into the framework of an additive regression 
function as in Mammen et al. [17]. Formally, we get their model by taking Zj = 1 for 
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all 1 < j < d in (1.1). However, for identifiability of rrij, we need the assumption that 
£'(ZZ T |X = x) is invertible; see the assumption (Al) below. Trivially, this assumption 
does not hold for the additive model with Zj = 1. For our model, we directly derive the 
theoretical properties of the algorithm and the estimators by borrowing some relevant 
theory on projection operators. 

Let IIj denote the projection operator II(-|Hj(M)) and IF, the projection operator 
II(-|Hj(M)). Define Qj = I — IF, and Qj = I — II,; these are the projection operators 
onto Wj(M)) and Hj(M.)) , respectively. From the backfitting algorithm (2.10), it 
follows that 



(3-1) 




0-1 d > 

[r-1] 



\ fe=i k=j 



A- 



Define Q = Qd • ■ ■ Qi ■ Repeated application of (3.1) for j = d, d — 1, . . . , 1 gives 

m-m'^Qtm-m^ 1 '). 
This establishes that 



that 



'l = Qm[ r - 1 ]+f = 


s=Q 


(3.2) 


vrite nij(x) = (0, 


. . .,0,rhj(xj),0,. . -,0) T , then fijih = 


= ihj so 


(n d + Q d n d _ 1 + . 


\-Qd--'Q2*h-!. 


(3.3) 



Convergence of the backfitting algorithm (2.10) depends on the statistical properties 
of the operator Q. Consider the event £ n , where f, ml ! g %(M) and the norm of the 
operator Q is strictly less than one, that is, ||Q|| < 1. Here and below, for an operator 

\\F\\ = sup{||Ff || M : f £ H(M), ||f || M < !}■ 

Then, in that event, ^^LqQ 3 ^ is well defined in 'H(M) and, by (3.2), m.M converges to 
J2^LoQ S ^ as r tends to infinity. The limit is a solution of the backfitting equation (2.9) 
since the latter is equivalent tom = Qm + f . Furthermore, the solution is unique since 
repeated application of rh = Qrh + f leads to rh = J27Lo Q s *- 

Below, we collect the assumptions that make the event £ n occur with probability 
tending to one and state a theorem for the convergence of the backfitting algorithm (2.10). 
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Assumptions. 

(Al) i?(ZZ T |X = x) is continuous and its smallest eigenvalue is bounded away from 

zero on [0, l] d . 
(A2) sup xe[01]d E(Zf\X = x) < oo for all l<j<d. 

(A3) The joint density p ofX is bounded away from zero and is continuous on [0,l] d . 

(A4) E\Y\ a < oo for some a > 5/2. 

(A5) K is a bounded and symmetric probability density function supported on 

[—1)1] and is Lipschitz continuous. The bandwidths hj converge to zero and 

nhj/(logn) — >• oo as n— » oo. 

The assumption (Al) implies the concurvity condition (2.1) since it implies that there 
exists a constant c> such that for f <G H(M), 



Im>c 



d - 

5£//i(*i) 2 Pi(sj)dsj> (3-4) 



where pj denotes the marginal density function of Xj. The inequality (3.4) also tells us 
that the convergence of m in %(M) implies the convergence of each component rrij in 
the usual Li norm. 

Theorem 1. Assume that (A1)-(A5) hold. Then, with probability tending to one, there 
exists a solution {rhj}^ =1 of the backfitting equation (2.5) or (2.9) that is unique. Further- 
more, there exist constants < 7 < 1 and < C < 00 such that, with probability tending 
to one, 

dr. dp 

Yl / t^i 1 fo ) ~ ™3 ( X j )fP] ( X j ) dx j ^ C l 2r Yl / [™j ( X j) 2 + ^ ( X j) 2 ]Pj ( X j ) dx 3 ■ 
3=1 3=1 

3.2. Asymptotic distribution of the backfitting estimators 

Next, we present the asymptotic distributions of rhj. Define 

n 

m A (x) = M(x)- 1 ™- 1 J2 Zt i Yt - ™(X\ Z 4 )] A" h (x, X*), 

i=\ 

where m(X, Z) is as given in (1.1), and let m B = rh — m A . As in the proof of Theorem 1, 
we can prove that, for s = A or B, there exists a unique solution rh s £ H(M) of the 
corresponding backfitting equation (2.9) where rh is replaced by hV. By the uniqueness 
of rh, it follows that rh = rh" 4 + rh B . 

Put m A = (m A , . . . ,rh A ) T E "H(M). In the proof of the following theorem, we will show 
that rh A are well approximated by rh A = (tljih A )j. Note that 

n 
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Assume that the bandwidths hj are asymptotic to Cjn^ 1 ' 5 for some constants < Cj < oo. 
By the standard techniques of kernel smoothing, it can be proven that, for x in (0, l) d , 
(rh^-(xj), . . . 7 rhj(xd)) T , and thus m A , is asymptotically normal with mean zero and 
variance n~ 4 ' 5 di3i.g(vj(xj)), where 

E[zy(X,Z)\X^ Xj] r 

and c 2 (X, Z) = var(F|X, Z). Here, it is worth noting that the vector rh A , which belongs 
to L 2 (M), docs not equal (mf-(xj), . . . ,m^(x d )) T £H(M). 

The bias of the estimator m conies from m B , which is the projection of ih B = 
(mf,...,mf) T ontoH(M). Define r/(x) = (cfmi'(xi), ... ,(%m'J l (xd)) T and /3 (x) by 



A)W 



^c^( 2 ; fc Mx)- 1 J B(ZZ T |X = x)- 1 A (jB( zz fc |X = xMx)) + ^(x) 
.fc=i Xfc 

x / u 2 K{u) du. 



Note that m and /3 do not belong to "H(M). In the proof of the next theorem, we will 
show that /3o( x ) i s the asymptotic bias of m(x) as an estimator of m(x) and that the 
asymptotic bias of m(x) equals /3(x), where /3 is the projection of f3 onto 'H(M): 

/3 = n(/3 |H(M)) = argmin A/3 (x) - f(x)] T M(x)[/3 (x) - f(x)] dx. 
few(M) J 

We write /3(x) = (ft (an), . . . ,f3 d (x d )) T . 

The following theorem, which demonstrates the asymptotic joint distribution of rhj, 
requires an additional condition on rrij . 

(A6) £(ZZ t ct 2 (X, Z)|X = x) is continuous on [0, l] d . 

(A7) The coefficient functions m,j are twice continuously differcntiablc on [0,1], and 

E(ZjZk\~K = x) is continuously partially differentiable on [0, l] d for all 1 < j, k < 

d. 

Theorem 2. Assume that (A1)-(A7) hold and that the bandwidths hj are asymptotic 
to Cjn^ 1 ' 5 for some constants < Cj < oo. Then, for any x <S (0, l) d , n 2 ' 5 [rhj(xj) — 
m j( x j)] f or 1 — J ^ ^ are jointly asymptotically normal with mean (/?i(xi), . . . ,(3d{xd)) T 
and variance di&g(v j(xj)). 

4. The method with local polynomial fitting 

The method we studied in the previous two sections is based on local constant fitting, 
where we approximate fj(X*) by fj(xj) when X* are near Xj, in the least-squares 

criterion ^2™ =1 [Y l — ^2 i=1 fj(Xj)Zj] 2 . The method may be extended to local polyno- 
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mial fitting, where we approximate fj(X*) by fj(xj) + (Xj — x j)fj \ x j) + "' + (Xj — 

XjY /■ {xj)/it\ for X 1 , near Xj. Here and below, g( fc ' denotes the fcth derivative of a func- 
tion g : R -> R. Define 



w i (a;j,u i )= !. 



hj J \ hj 



We consider the following kernel-weighted least-squares criterion to estimate m,j 



/n 
i=l 



J'=l 



X h (x,X 4 )dx, (4.1) 



where f T = (fj 7 , . . . ,fj) and fj(xj) = (fj,o(%j), ■ ■ ■ ,fj,n(xj)) T for functions fj^ : R — >• M. 
Let rh be the minimizer of £(f). The proposed estimators of rrij are then m^o in m, 

and rhj.k in m are estimators of h h mh /k\. We thus define the proposed estimators 

of mS by 

rh, (xj) = k\h~ rhj^kixj), < k < ir, l<j<d. 
The minimization of L(i) at (4.1) is done over f with L(f) < oo. Define 

v(u,z;x) T = (wi(a;i,iii) T zi,...,w d (x d ,w d ) T z d ). 

The expression in the bracket at (4.1) can then be written as Y l — v(X l , Z l ;x) T f(x). We 
now redefine M used in the previous two sections as 

n 

M(x)=n- 1 ^v(X ? ,Z l ;x)v(X' i ,Z l ;x) T A' h (x,X I ). (4.2) 

»=i 

L(f ) < oo is then equivalent to J f (x) T M(x)f (x) dx < oo and minimizing L(f ) is equiv- 
alent to minimizing J[m(x) — f (x)] T M(x)[m(x) — f (x)] dx, where m(x) is redefined as 

n 

m(x)=M(x)- 1 n- 1 ^v(X l ,Z 4 ;x)r t A' h (x,X l ). (4.3) 

4=1 

The function space that arises in this general problem is the class of (ir + l)d-vectors of 
functions f = (fj,k) such that /f(x) T M(x)f(x)dx < oo and /j,fc(x) = gj,k{%j) for some 
functions gj t k ■ R — >• R, 1 < j < d and < k < n. We continue to denote the function space 
by 'H(M), and its norm by || • ||jq,. Thus, 

m = argmin 1 1 m — f 1 1 ^ . (4.4) 

fe-H(M) 

By considering the Gateaux or Frechet derivatives of the objective function L(f) with 
respect to f , the solution m of the minimization problem (4.4) satisfies the following 
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system of integral equations: 

0= /M J (x) T [rh(x)-m(x)]dx_ J , l<j<d, (4.5) 

where is the (n + l)-dimcnsional zero vector and Mj are (tv + l)d x (n + 1) matrices 
defined by M = M T = (Mi,. . .,M d ). We write rh T = (ihj , . . -,mj). Define 

n 
n 

9 j (x s )=rr 1 Y t ^j(x j ,XJ)w j (x j ,X*) T K hi {x j ,XJ)(Z i j ) !i , 

i=l 
n 

4f jk (x J ,x k ) = n- 1 Y / ^^^})w k (x k ,Xl) T K h] (x J ,Xi)K hk (x kl Xl)Z^Zl 
i=i 

for k ^ j. We then find that the system of (n + l)-dimcnsional equations (4.5) is equiva- 
lent to the following backfitting equations which update the estimators of rrij and their 
derivatives up to the 7rth order: 

d . 

mj(xj)=mj(xj)- ^2 / ^j{ x o)~ 1 ^ ! ok{xj,Xk)ra.k{xk)^Xk, l<j<d. (4.6) 

We want to emphasize again that the method with local polynomial fitting does not 
require computation of the full-dimensional estimator rii(x) at (4.3). It only requires one- 
and two-dimensional smoothing to compute ihj, \&j and ^fjk, and involves inversion 
of if! j only. Although <&j are (ir + 1) x (tt + 1) matrices, they are computed by means 
of one-dimensional local smoothing so that they do not suffer from sparsity of data in 
high dimensions. The marginal integration method, in contrast, requires the computation 
of m(x) and thus, in practice, the marginal integration method may break down in the 
case where d is large. 

Backfitting algorithm. With a set of initial estimates rh' = (rhj t o, . . . ,fhj !W ) , we 
iterate for r — 1,2,... the following process: for 1 < j < d, 

j'-l 



k=l J 
d p 

Y, / *i(a:i)~ 1 *ifc(a;j,Sfc)iiiL r ~ 11 (a;fc)d« 



In the following two theorems, we show that the backfitting algorithm (4.7) converges 
to txij, 1 < j < d, at a geometric rate and that rhj, 1 < j < d, are jointly asymptotically 
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normal. We give the results for the case where n, the order of local polynomial fitting, is 
odd. It is widely accepted that fitting odd orders of local polynomial is better than even 
orders. It also gives simpler formulas in the asymptotic expansion and requires a weaker 
smoothness condition on _E(ZZ T |X = x). In fact, instead of (A6) in Section 3, we need 
the following assumption: 

(A7') The coefficient functions mj arc (jt + l)-times continuously diffcrcntiable on 
[0, 1] and E(ZjZ k \X = x) is continuous on [0, l] rf for all 1 < j, k<d. 

To state the first theorem, we need to introduce the limit of the matrix M(x). Note 
that M(x) consists of (n + 1) x (n + 1) blocks 

n 

M ijfc (x) = n- 1 J2 w,(x„X;)w fe ( a ; fe ,^) T Z;^A h (x,X 4 ), l<j,k<d. 

4=1 

For j =£ k, the matrices rVLj-^x) are approximated by 

E[w j (x j ,X j )w k (x k ,X k ) T Z}ZlK h ( x ,X)]~ W T E(Z 3 Z k \X = x)p(x), 

where /x = ([ig(K)) and fie(K) = J u l K(u) du. On the other hand, for j = fc, 

M j , j (x)~N 1 E(Z?\X = x)p(x), 

where Ni is a (tt + 1) x (n + 1) matrix defined by Ni = (fie + e>(K)). Here, we adopt the 
convention that the indices of the matrix entries run from (0, 0) to (it, n). Thus, M(x) is 
approximated by 

M(x) =p(x)[£(ZZ T |X =x) <8> (/x/x T ) + di&g(E(Z]\X = x)) ® (Ni - /*/i T )], (4.8) 

where (8 denotes the Kronecker product. The matrix M(x) is positive definite under the 
assumption (Al). To see this, note first that by (Al), the matrix E(ZZ T |X = x) ® (/^M T ) 
is nonnegative definite. Also, E(Z^\X = x) are bounded away from zero on [0, l] d for all 
1 < J < d. Furthermore, Ni — fifi T is the variance-covariance matrix of (1, U,... , U^) T , 
where U is a random variable with density K. Since K is supported on a uncountable 
set, it follows that Ni — fifi T is positive definite. The foregoing arguments show that the 
smallest eigenvalue of M(x) is bounded away from zero on [0, l] d . Let TL(M.) be defined as 
H(M) with M being replaced by M and define its norm by ||f |||^ = / f (x) T M(x)f (x) dx. 

Theorem 3. Assume that (A1)~(A5) hold. Then, with probability tending to one, there 
exists a solution {rhj} d =1 of the backfitting equation (4-6) that is unique. Furthermore, 
there exist constants < 7 < 1 and < C < 00 such that, with probability tending to one, 

d p 

J2 / I ™i ] fa ) _ m J fa ) 1 2 PJ fa ) dx i 

< ci 2r j2 h\^fa)\ 2 + im; oi (^-)i 2 fe(^)d^. 



i=i 
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In the next theorem, we give the asymptotic distribution of the proposed estimators. 
We define m(x) T = (mi(ii) T , . . . ,md(xd) T ), where 

m 3 { Xj ) = (m j (x j ),h j mf\x j )/ll,...,h?m < f ) (x j )/Trl) T . (4.9) 

For the bandwidths hj , we assume that hj is asymptotic to Cjn~ 1 ^ 2 '" +3 ^ for some constant 
< Cj < oo. Define 7 = (fj,^+i(K), . . . , ^r+i+^lif)) and a (it + 1) x (it + 1) matrix by 

N 2 = (ne+e>(K 2 )). For l<j<d, define /J.fo) = c^N^-ym^ (xj) / (tt + 1)! and 



£[Z?<7 2 (X,Z)|X j = 






^^^ cj P j{xj)[E{Z 2 \Xj=xj)}^ ^ • 

Theorem 4. Assume iftai (A1)-(A6) and (A7') /io/d ; and i/iai the bandwidths hj are 
asymptotic to Cjn~ 1 " 2n+3 > for some constants < Cj < 00. Then, for any x G (0, l) d , 
n (7T+i)/(27r+3) x [jiij(xj) — m.j(xj)], l<j<d, are asymptotically independent and 

nOH-D/^+^fo) - mj ( Xi )] 4 NiPjix&Vjixi)), 1 < j < d. 

Theorem 4 not only gives the asymptotic distributions of the estimators of the co- 
efficient functions rrij, but also those of their derivatives. Recall the definition of nx,- 
at (4.9) and that ihj(xj) = (rhj(xj),hj7hj '(xj)/V.,. . .,h^fh^(xj)/ir\) T . Thus, the the- 
orem implies that n^ +1 ~ k '/( 2 * +3 '[fhS \Xj) — mS (xj)] is asymptotically normal with 
mean kk T ' +1 ~ k (N^ 1 f) k x m {lT+1 \xj)/{iT + 1)! and variance 



(fc!) 2 (N 1 - 1 N 2 Nr 1 ) 



E[Z 2 o 2 (^Z)\X 3 =x 3 ] 
kk C f +l p 3 (xj)[E(Z 2 \X 3 =x 3 )} 2 ' 



where, for a vector a and a matrix B, a^ denotes the fcth entry of a and B^^ denotes 
the fcth diagonal entry of B. In the case of local linear fitting (jr = 1), we have 

(N- 1 N 2 Nr 1 )oo = / K 2 (u) d«, (N- 1 7 )o = / u 2 K(u) d«. 

Another implication of Theorem 4 is that the estimators rhS (xj) for < k < -k have 
the oracle properties. Suppose that we know all other coefficient functions except rrij. In 
this case, we would estimate m,j and its derivatives up to order -k by minimizing 



"E 



d 

Y*- Y, m k (XDZl-Z>w 3 (x 3 ,X}) T { 3 (x 3 ) 



-, 2 

K h3 (x„X]) 



over fj. It can be shown that the resulting estimators of mS (xa) for < k < n have the 
same asymptotic distributions as rhS (x ) for < fc < n. 
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The asymptotically optimal choices of the bandwidths hj may be derived from The- 
orem 4. Let c? bj(xj) and c~ Tj(xj) denote the asymptotic mean and the asymptotic 
variance of n^ +1 ^^ + ^[rhj{xj) — rrij(xj)], respectively. That is, 

b j (x j ) = (N^) m^ +1 \x j )/(n + l)\, 

The optimal choice of Cj which minimizes the asymptotic mean integrated squared error 
is then given by 



opt 

C J = 



2(Tr + l)fb J (x j ) 2 p j (x j )dx j 



(4.10) 



This formula for the optimal bandwidth involves unknown quantities. We may get a rule- 
of-thumb bandwidth selector by fitting polynomial regression models, as in Yang et 
al. [24], to estimate the unknown quantities in the formula for c° p ; see Section 6, where we 
employ this approach to analyze Boston Housing Data. Alternatively, we may adopt the 
approach of Mammen and Park [18] to obtain more sophisticated bandwidth selectors. 

5. Numerical properties 

We investigated the finite-sample properties of the proposed estimators in comparison 
with the marginal integration method studied in Yang et al. [24] . We considered the case 
where local linear smoothing (tt = 1) is employed. The simulation study was done in two 
settings, one in a low-dimensional case (d = 3) and the other in a high-dimensional case 
(rf=10). 

In the first case, we generated the data (X 1 , Z l , Y l ) from the model: Y = mi(Xi)Zi + 
m 2 (X 2 )Z 2 + m 3 (X 3 )Z 3 + a(X, Z)e, where Z\ = 1 and 



1 z\ + z 2 ___ ( n i x\ + x 2 

2 1 + zk + Z: 



-(x,z) = ^ + T ^^3exp(-2 + ±i^). (5.1) 



The vector X = (X\,X 2l Xz) was generated from the uniform distribution over the unit 
cube (0, l) 3 , and the covariate vector (Z 2 , Z 3 ) was generated from the bivariate normal 
with mean (0, 0) and covariance matrix (q 1 . ° 1 5 ) . The vectors X and Z were independent, 
and the error term e was generated from the standard normal distribution, independently 
of (X, Z). This model was also considered in Yang et al. [24]. We took mi(x) = 1 + c 2x ~ 1 , 
m 2 (x) — cos(27ta;) and 7713(0;) = x 2 . 

In the second case, where d= 10, we took the same variance function <r 2 (x, z) as 
in (5.1), for the sake of simplicity. Thus, a 2 (x, z) did not depend on (xj , zf) for 4 < j < 10. 
The extra covariates Xj for 4 < j < 10 were generated from the uniform distribution 
over (0, l) 7 independently of (Xi,X 2 ,X 3 ), and Zj for 4 < j < 10 were generated from 
the multivariate normal distribution with mean and covariance matrix I, the identity 
matrix, independently of (Z 2 , Z 3 ) and of X. We chose rrij(x) = x 2 for 4 < j < 10. 
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We used the Epanechnikov kernel K(u) = (3/4)(l — u 2 )I[— 1, l](u) and the optimal 
bandwidths h s ^ f = c opt n -1 / 5 , where c° p are given at (4.10). This was for the proposed 
estimator. For the marginal integration method, the estimator rtif 1 of the jth coefficient 
function mj that we investigated was 

n 



! 



( Xj ) = n- l ^e j (x{,...,x;_ 1 ,x j ,x] +l ,...,x?), 



i=\ 



where 6j(x) was the [(j — l)(ir + 1) + l]st entry of m(x) defined at (4.3), but Kh k 
(for k j£ j) in the definition of m(x) was replaced by Lt, k . Note that, for the marginal 
integration method, in the estimation of the jth coefficient function, we may choose 
another kernel L and need to use other bandwidths bk, different from hj , for the directions 
of Xk{k 7^ j) not of interest. We took L = K and bk = c(logn) _1 /i" n for all directions 
k =/= j, where /i" 11 is the optimal bandwidth for the marginal integration method, obtained 
similarly as the one for the proposed method at (4.10), and c was a constant multiplier 
for which we tried four values, c= 1, 3, 5, 10. 

We used rhj defined in Section 4 as the initial estimates rh'- ' for the proposed method. 
The backfitting algorithm converged very fast. We took 



^y^-^o-m? 1 ^)] 2 ^ ^ i(ri1 



as a criterion for the convergence. With this criterion, the backfitting algorithm converged 
within 11 iterations in all cases. The average number of iterations was 6.5 from the 500 
replications. In a preliminary numerical study with the marginal integration method, we 
found that inverting the matrix M(x) often caused numerical instability of the estimates, 
even for the low-dimensional case where d = 3. This reflects the curse of dimensionality 
that the marginal integration suffers from. Thus, we actually computed a 'ridged' version 
of m" 11 by adding n~ 2 to the diagonal entries of the matrix M(x). The same modification 
was also made in the numerical study of Yang et al. [24] . 

Table 1 shows the results for the case d = 3, based on 500 data sets with sizes n = 100 
and 400. The table provides the mean integrated squared errors (MISE) of the estimators 
of each coefficient function rrij , defined by 

MISEj (fhj) = / E[fhj(xj) — rrij(xj)] dxj 

[Efhj (xj ) — rrij (xj )] 2 dxj + / [fhj (xj ) — Efhj (xj )] 2 dxj 

^ISBj-foO+IVjCm;) 

for an estimator fhj. It also gives the integrated squared bias (ISB) and the integrated 
variance (IV). The results suggest that the proposed method gives better performance in 
terms of MISE tot = Y^=\ MISEj . When n = 100, the sum of MISE^ of fhj equals 0.7621, 
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Table 1. The mean integrated squared errors (MISE), the integrated squared biases (ISB) and 
the integrated variances (IV) of the marginal integration estimators (MI) and the proposed 
estimators (SBF) when d = 3 (the constant c for MI is the multiplier c in the formula bk = 
c(logn) /i™ 1 , where bk is the secondary bandwidth applied to the direction of Xk, k^j, in the 
estimation of m,) 



Sample 


Coefficient 








MI 




SBF 


size 


function 




c=l 


c = 3 


c = 5 


c=10 




n=100 


in i 


MISE 


0.1190 


0.1140 


0.1158 


0.1151 


0.1496 






ISB 


0.0174 


0.0150 


0.0147 


0.0145 


0.0019 






IV 


0.1016 


0.0990 


0.1011 


0.1006 


0.1476 




mi 


MISE 


0.6354 


0.5738 


0.5795 


0.5826 


0.3613 






ISB 


0.4089 


0.3502 


0.3465 


0.3461 


0.0484 






IV 


0.2265 


0.2236 


0.2330 


0.2364 


0.3129 




111.; 


MISE 


0.1873 


0.2218 


0.2255 


0.2259 


0.2512 






ISB 


0.0057 


0.0056 


0.0056 


0.0056 


0.0017 






IV 


0.1816 


0.2163 


0.2200 


0.2203 


0.2495 


n = 400 


m\ 


MISE 


0.0347 


0.0332 


0.0365 


0.0363 


0.0415 






ISB 


0.0092 


0.0087 


0.0087 


0.0086 


0.0005 






IV 


0.0255 


0.0245 


0.0279 


0.0277 


0.0410 




in-2 


MISE 


0.2648 


0.2815 


0.2872 


0.2894 


0.1244 






ISB 


0.2126 


0.2227 


0.2248 


0.2257 


0.0199 






IV 


0.0521 


0.0588 


0.0624 


0.0637 


0.1045 




111:', 


MISE 


0.0478 


0.0576 


0.0610 


0.0620 


0.0810 






ISB 


0.0050 


0.0046 


0.0046 


0.0047 


0.0008 






IV 


0.0428 


0.0529 


0.0564 


0.0573 


0.0802 



while those of the marginal integration method are 0.9417,0.9096,0.9208,0.9236 for c = 
1,3,5,10, respectively. In the case where n = 400, MISE to t = 0.2469 for the proposed 
method, while it equals 0.3473, 0.3723, 0.3847, 0.3877 for the marginal integration method. 

According to Table 1, the performance of the marginal integration method appears 
not to be sensitive to the choice of the secondary bandwidth bk- However, this is true 
only when we use the optimal bandwidth /i lnl . In fact, we found that the performance 
depended crucially on the choice bk when other choices of hj were used. As an example, 
we report in Table 2 the results when one uses hj = h^/3 instead of hj = hf 1 . In the 
latter case, the sum of MISEj ranges from 0.8001 to 2.7453 when n — 100, and from 
0.2291 to 2.1080 when n = 400, for those four values of c. One interesting thing to note 
is that the ISB of the marginal integration increases drastically as c decreases. The main 
lesson here is that the choice of the secondary bandwidths bk for the marginal integration 
method is as important as the choice of hj . 

The finite-sample results in Table 1 show some discrepancy with the asymptotics for 
the functions mi and 7713. Asymptotically, if the optimal bandwidth is used, then the IV 
is four times as large as the ISB. In general, finite-sample properties do not always match 
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Table 2. The mean integrated squared errors (MISE), the integrated squared biases (ISB) and 
the integrated variances (IV) of MI when hj = /i™'/3 was used (the constant c is the multiplier c 
in the formula b^ — c(logn) _1 /i™', where bk is the secondary bandwidth applied to the direction 
of Xk, k 7^ j, in the estimation of rrij) 



Sample 


Coefficient 






MI 




size 


function 




c = l 


c = 3 


c = 5 


c=10 


n=100 


in i 


MISE 


1.5109 


0.2664 


0.1822 


0.1737 






ISB 


1.4327 


0.0096 


0.0011 


0.0012 






IV 


0.0782 


0.2568 


0.1812 


0.1725 




mi 


MISE 


0.6611 


0.4578 


0.3459 


0.3576 






ISB 


0.3095 


0.0340 


0.0338 


0.0313 






IV 


0.3516 


0.4238 


0.3121 


0.3263 




111.; 


MISE 


0.5733 


0.2743 


0.2720 


0.3012 






ISB 


0.0217 


0.0013 


0.0014 


0.0015 






IV 


0.5516 


0.2730 


0.2706 


0.2997 


n = 400 


Ill i 


MISE 


1.4539 


0.0891 


0.0465 


0.0465 






ISB 


1.4177 


0.0032 


0.0004 


0.0003 






IV 


0.0362 


0.0859 


0.0461 


0.0462 




mi 


MISE 


0.3554 


0.1596 


0.1109 


0.1129 






ISB 


0.2359 


0.0139 


0.0154 


0.0160 






IV 


0.1195 


0.1457 


0.0955 


0.0969 




in.; 


MISE 


0.2987 


0.0702 


0.0717 


0.0856 






ISB 


0.0188 


0.0005 


0.0006 


0.0007 






IV 


0.2799 


0.0697 


0.0711 


0.0849 



with asymptotics. One possible reason for the discrepancy in this particular setting is 
that the coefficient functions m\ and m^ are far simpler than the complexity brought 
by the noise level, so the proposed method easily catches the structure with less bias. 
This seems not to be the case with the marginal integration, however. For the marginal 
integration, the secondary bandwidths bk interact with the primary bandwidth hj for 
the bias and variance performance, as discussed in the previous paragraph. 

Table 3 shows the results for the case d = 10. Here, for the marginal integration, we 
report only the results when c = 5 which gave the best performance. In fact, the marginal 
integration got worse very quickly as c decreased from c = 5. For example, we found the 
total MISE, J2)=i MISEj, was 3.4996 when c = 3 and was 6.1834 when c = 1, in the case 
where n = 400. Note that the value equals 0.6440 when c = 5 and n = 400, as reported 
in Table 3. For the proposed method, it equals 0.5136. 

6. Analysis of Boston Housing Data 

The data consist of fourteen variables, among which one is response and the other thirteen 
are predictors. There are 506 observations from 506 tracts in the Boston area; see Harrison 
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Table 3. The mean integrated squared errors (MISE), the integrated squared biases (ISB) and 
the integrated variances (IV) of the marginal integration estimators (MI) and the proposed 
estimators (SBF) when d = 10 



Sample 


Coefficient 
function 




MI 






SBF 




size 


MISE 


ISB 


IV 


MISE 


ISB 


IV 


n=100 


mi 


0.2533 


0.1242 


0.1291 


0.1904 


0.0046 


0.1858 




in 2 


0.7284 


0.4353 


0.2931 


0.4357 


0.0605 


0.3752 




m 3 


0.2622 


0.0059 


0.2563 


0.3042 


0.0024 


0.3018 




in 4 


0.1303 


0.0054 


0.1249 


0.1404 


0.0022 


0.1382 




?7l5 


0.1351 


0.0060 


0.1291 


0.1489 


0.0011 


0.1478 




me 


0.1336 


0.0055 


0.1281 


0.1509 


0.0019 


0.1490 




'"7 


0.1345 


0.0054 


0.1291 


0.1677 


0.0019 


0.1658 




ms 


0.1228 


0.0053 


0.1175 


0.1482 


0.0019 


0.1463 




iu<, 


0.1428 


0.0071 


0.1357 


0.1707 


0.0009 


0.1698 




mw 


0.1270 


0.0059 


0.1211 


0.1528 


0.0014 


0.1514 


n = 400 


nil 


0.0505 


0.0115 


0.0390 


0.0457 


0.0008 


0.0449 




1112 


0.2999 


0.2223 


0.0776 


0.1264 


0.0196 


0.1068 




III A 


0.0642 


0.0054 


0.0588 


0.0893 


0.0004 


0.0889 




<»4 


0.0324 


0.0048 


0.0276 


0.0379 


0.0005 


0.0374 




III-, 


0.0358 


0.0054 


0.0304 


0.0355 


0.0010 


0.0345 




me 


0.0369 


0.0040 


0.0329 


0.0331 


0.0004 


0.0327 




'"7 


0.0300 


0.0044 


0.0256 


0.0370 


0.0009 


0.0361 




ma 


0.0319 


0.0043 


0.0276 


0.0368 


0.0006 


0.0362 




mg 


0.0321 


0.0052 


0.0269 


0.0364 


0.0009 


0.0355 




mw 


0.0303 


0.0046 


0.0257 


0.0355 


0.0006 


0.0349 



and Rubinfeld [10] for details about the data set. The data set has been analyzed by 
Fan and Huang [7] and Wang and Yang [22], among others. The former fitted the data 
using a partially linear functional coefficient model where all coefficient functions in the 
nonparamctric part arc functions of a single variable. The latter considered an additive 
regression model. Here, we apply the varying coefficient model (1.1) to fit the data using 
the proposed method. We take the variable MEDV (median value of owner-occupied 
homes in $1000's) as the response variable Y. We consider five variables as covariates Xj 
or Zj . They are CRIM (per capita crime rate by town) , RM (average number of rooms 
per dwelling), TAX (full-value property tax rate per $10000), PTRATIO (pupil-teacher 
ratio by town) and LSTAT (percentage of lower income status of the population) . As in 
Wang and Yang [22], we take logarithmic transformation for TAX and LSTAT to remove 
sparse areas in the domains of these variables. 

We want to find a varying coefficient model that fits the data set well. Since LSTAT 
can be a good explanatory variable that determines the overall level of the housing price, 
wc consider models of the form 



MEDV = ?7ii(log(LSTAT)) + to 2 (X 2 )Z 2 + m 3 (X 3 )Z 3 + (noise). 



(6.1) 
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Table 4. Relative squared prediction errors obtained from fitting f 2 varying coefficient models 
with the Boston Housing Data 



Model no. 




Covariates 




Relative squared 




x 2 


z 2 


x 3 


z 3 


prediction error 


1 


CRIM 


RM 


TAX 


PTRATIO 


0.3514 


2 


CRIM 


RM 


PTRATIO 


TAX 


N/A 


3 


CRIM 


TAX 


RM 


PTRATIO 


0.2700 


4 


CRIM 


TAX 


PTRATIO 


RM 


0.2688 


5 


CRIM 


PTRATIO 


RM 


TAX 


0.4390 


6 


CRIM 


PTRATIO 


TAX 


RM 


0.4757 


7 


RM 


CRIM 


TAX 


PTRATIO 


0.3010 


8 


RM 


CRIM 


PTRATIO 


TAX 


0.2412 


9 


RM 


TAX 


PTRATIO 


CRIM 


N/A 


10 


RM 


PTRATIO 


TAX 


CRIM 


N/A 


fl 


TAX 


CRIM 


PTRATIO 


RM 


N/A 


12 


TAX 


RM 


PTRATIO 


CRIM 


N/A 



A general question is which variables should be the model covariates Zj and which should 
take the role of Xj. This may be obvious for some data sets, but it is not so clear for 
the Boston Housing Data. Thus, we fitted all possible models and chose the one that 
best fitted the data. In general, we do not suggest employing the all-possible-models 
approach since it can get out of control quickly as the number of variables increases, and 
it induces a certain arbitrariness in the choice. For the Boston Housing Data, there are 
only twelve varying coefficient models of the form (6.1), listed in Table 4, and all models 
are interpretable. If the number of variables is large, then we suggest first choosing a set 
of model covariates Zj among all covariates by fitting parametric linear models and using 
a variable selection technique, and then picking one as Xj for each Zj from the remaining 
variables based on a criterion such as RSPE (which is defined later) . 

We employed local linear smoothing in implementing the proposed method and used 
the Epanechnikov kernel. For the bandwidths hj , we chose to use a rule-of-thumb method 
that we describe below. Note that the unknowns in the expression of the optimal band- 
width at (4.10) are Aj = Jm' J !(x j ) 2 p j (x j )dx j , Bj(xj) = E[Z?a 2 (X,Z)\Xj = Xj] and 
Cj(xj) = E(Z 2 \Xj = Xj). The second derivative of rrij in Aj can be estimated by fit- 
ting a cubic polynomial regression model. This gives Aj = n^ 1 X)iLi(2ay,2 + Ga^Xj) 2 , 
where c\j } k are the least-squares estimators that minimize 



E 



1 2 



Y l 



^Ko + a hl X) + a ja Xf + a ]i3 Xf)Z} 



Here, we take Z[ = l. The conditional means, Bj and Cj, can be estimated by fitting 
linear regression models. Since E[Z 2 a 2 (X, Z)\Xj = Xj ] = E[Z 2 AY - m(X, Z)) 2 |X 



'./J- 
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the conditional mean Bj is estimated by Bj(xj) = /3jo + j3j t xXj, where (3jfi and j3j,\ 
minimize 



E 



2f [ y* _ ^(d fe , + a ktl Xt + a k , 2 Xl 2 + a M Xf)^ - &,„ - &,iXj 



fe=i 



Similarly, Cj for j = 2,3 are estimated by Cj(xj) — 7^0 + Tj.i^j) where 7^0 and 7^.1 
minimize X)"=i(^] 2 — Ti,o — 7j,i^]) 2 - Note that C*i = 1. 

We split the data set into two parts, one for estimation of the models and the other 
for assessment of the estimated models. We selected 100 tracts for the model assessment 
out of 506 distributed in 92 towns. This was done in a manner that would lead to more 
selections in a town with a larger number of tracts. We fitted the twelve varying coefficient 
models using the data for the remaining 406 tracts and made out-of-samplc predictions 
with the data for the selected 100 tracts. We calculated their relative squared prediction 
errors, 

RSpE = E!£°i[MEDV' -m 1 (log(LSTAT')) - m 2 (X* 2 )Z* 2 - m 3 (XJ)Ztf 

X)-°i [MEDW -MEDV] 2 

where rhj for j = 1,2,3 were constructed by using the data for the 406 remaining tracts. 

Table 4 reports the results. In the table, we do not provide the values of RSPE for the 
models numbered 2, 9, 10, 11 and 12. In the preliminary fitting of these models taking Xj 
and Zj as specified, we found that they produced extremely large residuals for some of 
the observations that corresponded to PTRATIO = 20.2 or TAX = 666. This resulted in 
a negative value of Bj{xj) for a certain range of Xj and, as a consequence, produced 
a negative estimate of J Tj(xj)pj(xj)dxj in the bandwidth formula (4.10). Since these 
five models do not explain MEDV well as a function of the covariates and would give 
a large value of RSPE when fitted, we excluded them from further analysis. 

According to the table, the model with the smallest RSPE is 

MEDV = mi(log(LSTAT)) + m 2 (RM) CRIM + m 3 (PTRATIO) log(TAX) + (noise). 

(6-2) 
Figure 1 depicts the estimated coefficient functions rh\ , m 2 and 7713 . It also plots the actual 
values of MEDV and their predicted values according to the estimated model from (6.2). 
The prediction was made for those 100 tracts that were not used in estimating the model. 
The estimated curve rh\ indicates that a high percentage of lower income status decreases 
the prices of homes. The estimated curve m 2 suggests that for towns with higher or lower 
average numbers of rooms per dwelling, the crime rate is less influential on the prices of 
homes. Finally, from the estimated curve 7713, we see that if the pupil-teacher ratio gets 
higher, then the prices of homes increase less rapidly as the property tax rate increases. 
The curve 7713 looks somewhat rigid. The reason for this is that the variable PTRATIO 
does not really take values on a continuous scale since it is the pupil-teacher ratio by town, 
so that all tracts in a town have the same value of PTRATIO. Furthermore, some towns 
share the same value with others. For example, the 132 tracts (out of 506) associated 
with the 15 towns in the city of Boston have the same value, 20.2. 
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2.0 2.5 3.0 

log(LSTAT) 





Figure 1. For the final model (6.2), the upper-left, upper-right and lower-left panels depict the 
estimated coefficient functions mi, rh,2 and ms, respectively, and the lower-right panel exhibits 
plots of the observed values Y % versus their predicted values Y l . 

Appendix: Technical details 
A.l. Proof of Theorem 1 

We prove that there exists a constant < 7 < 1 such that ||Q|| < 7 with probability 
tending to one. Let TLj(M.) be defined as Hj(M.) with M being replaced by M. Let Pj 
and pjk denote the marginal densities of Xj and (Xj,Xk), respectively. Define 



Qj( x j) = e ( z 1\ x j = x 3 )pj{xj), 



(A.l) 
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qjk{x 0l x k ) = E(ZjZ k \Xj = Xj,X k = x k )p jk (xj,x k ), kj^j. 
For i, en,(M), 

\\fj IIm = J f,(x) T M(x)f,(x) dx = J frixtfq^dxj. 

The equality (A. 3) follows from the identity 

E(Z]\X = x)p(x) dx_, = E{Z)\X, = Xjfafa). 

From (A. 3) and Holder's inequality, it follows that, for f <G H(M.), 
\\(Qj -Qi)t ||m 

q jk (xj,x k ) q jk (xj,Xk 
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(A.2) 

(A.3) 



E 



* E 




q 3 (x 3 )q k (x k ) qj(xj)qk(xk), 

n l/2 

fk(x k ) 2 q k (x k )dx k 



fk(x k )dx k qj(xj)dx 



qj(xj)q k (x k )dxjdx k 



1/2 



1/2 



<O p (l) J2 II^Hm- 
fc=l,#j 

Since ||Qj|| = 1, this proves that ||Qj|| < C\ with probability tending to one for some 
constant < C\ < oo. Define Q = Qd ■ • • Qi- Then, 



WQ-QW 



d-l 



2_^Qd- ■ -Qd-k+l{Qd-k — Qd-k)Qd-k-\ ' ' ' Q\ 



k=0 



O p (l), 



where we interpret both Qd+i and Qo as the zero operator. From (Al), (A3) and (A.3), 
the projection operators IT, : Hk(M) — > TLj(M.) for all 1 < j ^ k < d are Hilbcrt-Schmidt. 
By applying parts B, C and D of Proposition A. 4. 2 of Bickel, Klaassen, Ritov and Wcll- 
ner [1], we find that ||Q|| < 1. This shows that there exists a constant < 7 < 1 such that 
IIQH < 7 with probability tending to one. 

To complete the proof of Theorem 1, it follows from (3.2) that with probability tending 
to one, 



m 1 ' — m 



M 



^Q s f + Q r m 



[o] 



<7 r ||*|| 






M 



1 



M 
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By (3.3) and the fact that 1 1 j 1 1 < Ci with probability tending to one, there exists a 
constant < C 2 < 00 such that with probability tending to one, 



|f||M< 



d r „ 



1/2 



This completes the proof of Theorem 1. 



A. 2. Proof of Theorem 2 



We will prove that for each x £ (0, l) d , 



mf{xj) = rhf(xj) + o p {n~ 2/5 ) for 1 < j < d, 



rii^fx) = m(x) + (3(x)n- 2/0 + o p {n~ 2/b ). 

Proof of (A.4). Note that m A = J2T=o Q S * A , wher e 

r A = (I - Q)m A = m A + Qd&j^ + ••• + &*••■ Q 2 mf 



(A.4) 
(A.5) 



(A.6) 



and mf(x) = (0,..., 0,mf(xj), 0,...,0) T . From formulas (2.6)-(2.8), it follows that 



x,A, 



Qd'-Qj+irhj (x) = (0, . . . , 0, rh j (xj),g j+ i(x j+ i), . . . ,g d (x d )) 



2 < j < d, 



for some random functions gt '■ R — > R, ] ' + 1 < k < d, where the first j — 1 entries of the 
vector on the right-hand side of the equation are zero. This implies that 



f A (x) = (mf(xi),g 2 (x 2 ), . . .,g d (x d )) T , 
where gu for 2 < k <d are random functions from R to R. If we prove that 



sup 

ce[o,i] d 



^Q^f^(x) 



s=l 



= o p (n- 2 / 5 ) 1 



(A.7) 



(A. 



then (A.7) implies (A.4) for the case j = 1. By exchanging the entries of rh A , we can see 
that (A.4) also holds for j > 2. 

To prove (A. 8), it suffices to show that 



sup |Qf A (x)|=o p (n- 2 / 5 ), 

cG[0,l] d 



E< 



■o p (n- 2 /% 



(A.9) 
(A.10) 






To sec this, note that from (2.6) and (2.7), we have, for f = (f t , . . . , f d ) T e H(M), 



}j f ( X ) = (/lfcl), • ■ • J fj-l(Xj-l), fj(Xj), f j+ l(x j+1 ),. . . , f d (x d )) 



(All) 
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*<1 



25 



where /* (xj ) = — J2t=i =t j J fk ( x k ) 9j ^ (xA dxfc . Thus, there exists a constant < C < oo 
such that with probability tending to one, 



sup 

xe[o,i] d 



£q s ^(x) 





oo 




OC 


sup 


Q]TQ s fV) 


<c 


£q s ^ 


G[0,l] d 


3 = 1 




s=l 



M 



We prove (A. 9) and (A. 10). From standard kernel theory, we can prove that for all ky^j, 



sup 

rc fc e[0,l] 



~ At \ Qjk{Xj,Xkj , 

mf (xj) — , , — dec, 
J gfc(xfc) 



: o p (n- 2 / 5 ). 



(A.12) 



The approximation (A.12), together with the expressions at (A. 6) and (A. 11), gives (A. 9). 
Since ||Q|| < 7 with probability tending to one for some < 7 < 1, we have 



J2Q S 



<£ 7 S |IQf A ||M=o p (n- 2 / 5 ). 

M s=2 



This completes the proof of (A. 4). 



□ 



Proof of (A. 5). Let li(x, u) = ((m — X\)m' x {x\), . . ., (ud — Xd)m' d (xd)) T and l 2 (x, u) = 
{{u\ — xi) 2 m"(x 1 )/2 7 . . ., (ud — Xd) 2 m' d '(xd)/'2,) 1 ■ To get an idea of which terms in an 
expansion of m s (x) lead to the main terms in the expansion (A. 5), we note from an 
expansion of ?n(X l ) that rh s (x) is approximated by 



m(x) + Mfx) -1 ^- 1 £ Z i Z* T li (x, X 4 )if h (x, X 1 ) 
»=i 

n 

+ M(x)- 1 r l - 1 £z 4 Z iT l 2 (x,X l )K h (x,X 4 ). 



(A.13) 



Define m B ' 1 (x) = M(x)" 1 / M(x)li(x, u)# h (x, u) du. The second term of (A.13) is then 
approximated by m s ' 1 (x) + M(x) _1 ^2 k=1 [dM-k(x)/dxk\h 2 l rn' k {xk) § u 2 K (u) du. Also, 
the third term is approximated by {h\m'{{xi)/2, . . . ,h 2 l m' d '(xd)/2) T j u 2 K(u)du. Define 



M^r 1 J2^ M ^)hlmUx k ) + l(h 2 1 mUxi),...,h 2 m^x c 
fe=i 



d Xk -^-/-«-«v-»/ ■ 2 1 



x / u 2 K(u)du 



m^(x) 



and let rh B ' 3 (x) = rh B (x) - m(x) - m s,1 (x) - rh S ' 2 (x). 

For £= 1,2,3, define m B ' e to be the solution of the backfitting equation at (2.9) 
with m being replaced by m Be . By arguing as in the proof of (A. 4), we can deduce 
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that rhf^ixj) = o p ( n - 2 / 5 ) for all xj G (0,1). The projection of m 5 ' 2 onto H(M) is 
well approximated by the projection onto H(M) with a remainder <5 such that <J(x) = 
o p (n~ 2 / 5 ) for all xG (0,l) d . This proves that m B > 2 (x) = /3(x)?i- 2 / 5 + o p (?i- 2 / 5 ) for all 
xG(0,l) d . 

It thus remains to prove that m B,1 (x) = o p (n -2 ' 5 ) for all x G (0, l) d . For this bound, 
we will show that rh ■ ' (xj) = fij(xj) + o p (n~ 2 ' 5 ), uniformly for all Xj G [0, 1], 1 < j ' < d, 
where \ij (xj ) = aj (xj ) / J K^. (xj , Uj ) duj and aj (xj ) = m'- (xj ) J (uj — Xj)K\ li (xj , Uj ) duj . 
For a proof of this claim, it suffices to show that 



Mj(x) ' [m^'^x) - fi(x)} dx-j = o p (r 



-2/5N 



(A.14) 



uniformly for all Xj G [0, 1], 1 <j < d. Here, ^t(x) = (^i(xi), . . . ,/J,d{xd)) T ■ 
We prove (A.14). Note that, uniformly for Xj G [0, 1], 



M J (x) T /ix(x)dx_ j 



qjiu^Kh^x^u^dUj 



fljiXj) 



^ Hk(%k) qjk{uj,u k )K hj (x J ,u j )K hk (xk,u k )du j du k 






dx k 



+ o p (n- 2 / 5 ) 



■ qj(xj)a,j(xj)+ ^2 ak(xk)qjk(xj,Xk)dxk K hj (xj,Uj)duj + o p (n 2/5 ). 



Claim (A.14) now follows from the fact that 
/'M i (x) T m B ' 1 (x)dx_ J 



-q j (x J )a j (x j )+ ^2 a,k(xk)qjk(xj,Xk)dxk K hj (xj,Uj)duj + Op(n 2/5 ) 



uniformly for Xj G [0, 1]. 



D 



A. 3. Proofs of Theorems 3 and 4 



Recall the definitions of M and M at (4.2) and (4.8), respectively, in the case of local 
polynomial fitting. Let Hj (M) denote the space of (tt + l)d- vectors of functions f = (fj.k) 
in L-2(M) such that /?/(x) = gj^(xj), < £ < tt, for some functions gj,e:M. —> R and 
ffc = (fk,o, ■ ■ ■ , fk,ir) T — for k ^ j. As in the case of local constant fitting, we can write 
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H(M) = Hi(M) H h n d (M). Define H,(M) likewise. The vectors of functions that 

take the roles of q 3 and q 3k , respectively, are 

<S> 3 {x 3 ) = N l E{Z) 2 \X 3 =x 3 )p 3 {x 3 ) 1 

^ jk (x 3 ,x k ) = iiV T E(ZjZ k \X 3 = Xj,X k = Xk)Pjk(xj,Xk), k^j. 

We then have projection formulas analogous to (2.6)-(2.8). For example, for f 6 L2(M.) 
and g <E L 2 (M), we obtain 



(ft J f) j = *,(^)- 1 |M,(x) T f(x)dx_„ 
(n jg ), = * J (^)- 1 |M J (x) T g(x)dx_ J 



and (Iljf)fc = = (Ujg)k for k ^ j, where (Ilff)fc and (Hjg)k denote the fcth (n + 1)- 
vector of the projection of f onto "Hj(M.) and of g onto Hj(M), respectively. We can 
proceed as in the proof of Theorem 1 to prove Theorem 3. 

We prove Theorem 4. Decompose m at (4.3) as rh A + m B , where 

n 

ih A {x) = Mtx)- 1 ?}- 1 J2 v(X\ Z*; x)[Y { - m(X i , Z' i )]/v h (x,X i ). 

i=\ 

va. A and m B , 

equation (4.6). It follows that (Il 3 th A )j(xj) — m A (xj), where 



Define m and m from rh and m , respectively, to be the solutions of the backfitting 

I 3 m A ) 3 {x 3 ) = m A {x 3 ) 



^ A {x 3 )=^ 3 {x 3 )- 1 n- 1 Y,^ 3 (^^))K h] {x 3 ,X))Z}[Y^-m{yi\T)] 



i=i 



As in the proof of Theorem 2, we can prove that rh A {x 3 ) = ri\ A (x 3 ) +o p (n ( 7r + 1 )/( 27r + 3 )) 
for all x£ (0, l) d . The stochastic term m. A (xj) has mean zero and is asymptotically 
normal. Since ^ 3 (x 3 ) = ^ 3 (x 3 ) +o p (l) and 

n 

n-^^varlw,^,^)^^,^)^^!^^ 1 ] 



i=\ 



= N 2 £[ZjV(X, Z)\X 3 = x 3 ]p 3 (x 3 ) + o p (l), 
we find that the asymptotic variance of m. A {x 3 ) equals 

3 K l l >P 3 ix 3 ){E(Z2\X 3 =x 3 W 
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Next, we approximate rh B (x). Define 

1 n 

m S,1 (x) = ^ TT y y M(x)- 1 n- 1 ^v(X\Z";x) 

'xi-xi 



E3H 



j=i 



7T+1 



1 )fe,^J+ 1 



my^'ixj)^ 



/v h (x,X J ) 



and m B:2 (x) = m B (x) — m(x) — m B,1 (x). As in the proof of Theorem 2, we can show 
that mf' 2 (xj) = o p (n-( 7r+1 )/( 27r+3 )) for all Xj 6 (0,1). We compute m B ' x (x). We can 
prove that, for all Xj S (0, 1), 

I'm^dx^- 

jLt/^+i J^ / q j k(x j ,x k )hl +1 m]^ + 1) (x fc )dx fc (A. 15) 

7, T _i_" •/ 



(tt+1)! 



+ / l J +1 7(?J (x J )m^ +1) (x J ) 



+ o p (n-^ +1 )/( 2T + 3 )), 



where g.,- and g^fc are as defined at (A.l) and (A. 2), respectively, and /z,r+i = /i 7r+ i(i ; sr). 
We also have 



/ M i (x) T m B ' 1 (x)dx_ i =/x/x T J^ / q jk {x :j ,x k )\a^ ,l {x k )dx 



fc=i,^" 



+ N l9j (^)mf' 1 (x J ) + o p (?i-^+ 1 )/( 2w + 3 )) 



(A.16) 



for all Xj G (0, 1). Now, we observe that /x T N 1 1 = (1, 0, . . . , 0) since /z is the first column 
of Ni. Thus, 

A*M T N]; 1 7 = m(1, 0, ... ,0)7 = /x^+i- 

Comparing the two systems of equations (A. 15) and (A.16), and by the uniqueness 
of m 5,1 , we conclude that 

mf'Hxj) = (Nr 1 7 )/ l J +1 m^ +1) ( a;j )/(7r + 1)! +o p ( n -^ +1 ^^ + ^) 
for all Xj £ (0, 1), 1 < j < d. This completes the proof of Theorem 4. 
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